outlier observation
Effects of label noise on the classification of outlier observations
de Farias, Matheus Vinícius Barreto, de Castro, Mario
The following study presents results obtained from experiments in which, before training a classification model, we added noise to the labels of the training set, so that the information contained in this set is not entirely correct. In fact, most datasets encountered in practical situations contain some degree of noise, which highlights the importance of this type of study for new techniques before implementing them in real-world applications. In this case, we are interested in measuring the impact of noise addition on BCOPS (Guan & Tib-shirani, 2022), a algorithm based on conformal prediction (V ovk et al., 2005) which, when combined with other machine learning methods, allows the construction of prediction sets for the test set observations in classification tasks. Prediction sets are sets that contain the possible values (for regression tasks) or possible classes (for classification tasks) for new observations. These sets are constructed so that the probability of the true value or class being contained within them meets a coverage guarantee. In the work developed by Guan & Tibshirani (2022), the possibility of using these prediction sets to detect outlier observations - meaning, observations whose true class was not present during training - is emphasized. Thus, we aim to measure both the classification coverage and the abstention rate on outlier observations of the BCOPS algorithm under the addition of noise, considering some of the datasets and machine learning algorithms used by Guan & Tibshirani (2022).
A Mathematical Optimization Approach to Multisphere Support Vector Data Description
Blanco, Víctor, Espejo, Inmaculada, Páez, Raúl, Rodríguez-Chía, Antonio M.
We present a novel mathematical optimization framework for outlier detection in multimodal datasets, extending Support Vector Data Description approaches. We provide a primal formulation, in the shape of a Mixed Integer Second Order Cone model, that constructs Euclidean hyperspheres to identify anomalous observations. Building on this, we develop a dual model that enables the application of the kernel trick, thus allowing for the detection of outliers within complex, non-linear data structures. An extensive computational study demonstrates the effectiveness of our exact method, showing clear advantages over existing heuristic techniques in terms of accuracy and robustness.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Anomaly/Outlier Detection using Local Outlier Factors - DataScienceCentral.com
Outliers are patterns in data that do not confirm to the expected behavior. While detecting such patterns are of prime importance in Credit Card Fraud, Stock Trading etc. Detecting anomaly or outlier observations are also of importance when training any of the supervised machine learning models. This brings us to two very important questions: concept of a local outlier, and why a local outlier? In a multivariate dataset where the rows are generated independently from a probability distribution, only using centroid of the data might not alone be sufficient to tag all the outliers. Measures like Mahalanobis distance might be able to identify extreme observations but won't be able to label all possible outlier observations.
Anomaly/Outlier Detection using Local Outlier Factors
Outliers are patterns in data that do not confirm to the expected behavior. While detecting such patterns are of prime importance in Credit Card Fraud, Stock Trading etc. Detecting anomaly or outlier observations are also of importance when training any of the supervised machine learning models. This brings us to two very important questions: concept of a local outlier, and why a local outlier? In a multivariate dataset where the rows are generated independently from a probability distribution, only using centroid of the data might not alone be sufficient to tag all the outliers. Measures like Mahalanobis distance might be able to identify extreme observations but won't be able to label all possible outlier observations.